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1. Introduction 

We consider the problem of graph estimation in Gaussian graphical models. The 
graph of a random vector (X\, . . . , X p ) represents the conditional dependences 
between the variables X\,.,. ,X p . More precisely, assume that (Xi, . . . ,X P ) is 
a centered Gaussian vector with covariancc matrix S. Then, the law Ps of 
(X±, . . . ,X p ) is a graphical model according to a graph G, if for any nodes a 
and b that are not neighbours in G, the variables X a and Xj, are independent 
conditionally on the remaining variables. There exists a unique graph Gs which 
is minimal for the inclusion and such that Ps is a graphical model according 
to G-£. Our aim is to estimate this graph from a n-sample of Pg. We will pay 
a special attention to the case where n < p. In what follows, we shall always 
assume that £ is non-singular. 

The problem of graph estimation in Gaussian graphical model when the sam- 
ple size n is smaller (or much smaller) than the number p of variables is moti- 
vated by applications in post-genomic and is a current active field of research in 
statistics. Biotcchnological developments in proteomics or transcriptomics en- 
able to produce a huge amount of data. One of the challenges for the statistician 
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is to infer from these data the regulation network of a family of genes (or pro- 
teins) . The task is difficult due to the very high-dimensional nature of the data 
and the small sample size. For example, microarrays measure the expression 
levels of a few thousand genes and the sample size n is no more than a few 
tens. The Gaussian graphical modeling has been proposed as a tool to handle 
this issue, see e.g. the papers of Kishino and Waddell [14], Dobra et al [7], Wu 
and Ye [24] . The gene expression levels are modeled by a Gaussian law IPs and 
the regulation network of the genes is then depicted by the graph Gs of the 
conditional dependences. 

Many estimation procedures have been proposed recently to perform graph 
estimation in Gaussian graphical model when n < p. A first class of procedures is 
based on multiple testing on empirical partial covariancc. Actually, if Gs denotes 
the (minimal) graph of the law Ps, there is an edge in Gs between a and b, if and 
only if the conditional covariance of X a and Xt, given all the other variables is 
non-zero. When n < p, the empirical version of the latter conditional covariancc 
cannot be computed, so several papers suggest to use instead the empirical 
conditional covariance of X a and X b given {X S1 seS} for some subsets S of 
{1, . . . , p} \ {a, b} with cardinality less than n — 2. A multiple testing procedure 
is then applied to detect if the conditional covariance cov(X a , Xb\X s , s £ S) 
is non-zero. Wille and Biihlmann [23] restrict to the sets S of cardinality less 
than one, Castelo and Roverato consider the sets S with cardinality at most q 
(for some fixed q) and Spirtes et al. [20] (see also Kalisch and Biihlmann [13]) 
propose a procedure which avoid an exhaustive search over all S. A second class 
of procedures relies on the fact that the entries f2 ai b of the inverse covariance 
matrix £1 = S^ 1 are non-zero if and only if there is an edge between a and b in 
Gs- Several papers then suggest to perform a sparse estimation of Q in order 
to estimate the graph Gs, sec Huang et al. [12], Yuan and Lin [25], Banerjee 
et al. [1], Friedman et al. [10], and Fan et al. [9]. They propose to maximise 
the log- likelihood of under I 1 constraints to enforce sparsity and they design 
optimisation algorithms to perform this maximisation. Finally, a third class of 
procedures uses the fact that the coefficients 6 a j, of the regression of X a on 
{Xb, b a} are non-zeros if and only if there is an edge between a and b in Gs. 
Meinshauscn and Biihlmann [17] or Rocha et al. [18] perform regressions with I 1 
constraints, whereas Giraud [11] (see also Verzelen [21]) proposes an exhaustive 
search to obtain a sparse estimate of the matrix 9 and then detect the graph 
G E . 

In this paper, we propose a new estimation scheme which combines the good 
properties of these different procedures. Actually, the procedures based on the 
empirical covariance or on I rcgularisation share some nice computational prop- 
erties and they can handle several hundred of variables X±, . . . , X p . Nevertheless, 
the theoretical results assessing their statistical accuracy are either of asymp- 
totic nature or rely on strong assumptions on the covariance [15, 19]. Moreover, 
their performance heavily depends on one (or several) tuning parameter, which 
is usually not dimcnsionless and whose optimal value is unknown. On the other 
hand, the exhaustive search of [11] has a good statistical accuracy and strong 
theoretical results have been established, but the computational complexity of 
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the exhaustive search is huge and it cannot be performed when the number 
p of variables is larger than a few tens. Our strategy here is to build a data- 
driven family of candidates graphs with several fast above-mentioned procedures 
and then to apply the selection criterion presented in [11] to select one graph 
among them. The resulting estimation procedure can handle several hundred 
of variables Xi, . . . ,X P and presents good statistical properties. Actually, the 
procedure is shown to be consistent in a high-dimensional setting and its risk is 
controlled by a non-asymptotic oracle-like inequality. The assumptions needed 
to establish these results are weaker than those commonly used in the literature. 
In addition, numerical experiments show a nice behavior on simulated exam- 
ples. The procedure is implemented in the i?-packagc GGMselect available on 
http : //w3 . j ouy . inra . f r/unites/miaj /public/logiciels/GGMselect/ 

The remaining of the paper is organized as follows. We describe the estimation 
procedure in the next section and state some theoretical results on its statistical 
accuracy in Section 3. In Section 4, we carry out some numerical experiments 
and Section 5 is devoted to the proofs. 

Notations. To estimate the graph Gs, we will start from a n-sample X^\ . . . , X^ 
of the law Ps. We denote by X the n x p matrix whose rows are given by the 
vectors X^\ namely Xj i(1 = Xa^ for i = 1, . . . , n and a = 1, . . . , p. We write X Q 
for the a th column of X. We also set T = {1, . . . ,p} and for any graph G with 
nodes indexed by T, we write d a (G) for the degree of the node a in the graph 
G (which is the number of edges incident to a) and deg(G) = max ae r d a (G) for 

Q 

the degree of G. Moreover, the notation a ~ b means that the nodes a and b 
are neighbours in the graph G. Finally, we write O for the set of p x p matrices 
with on the diagonal, || ■ || gX p for the Frobenius norm on q x p matrices 

\\Af qXp = Tr{A T A) =J2J2 A h > 

i=l 3=1 

|| ■ || n for the Euclidean norm on M. n divided by s/n, and for any /3 € W we 
defined supp(/3) as the set of the labels a g T such that (3 a ^ 0. 

2. Estimation procedure 

GGMselect is a two-stage estimation procedure which first builds a data-driven 
family Q of candidate graphs and then applies a selection procedure to pick one 
graph among these. We present the selection procedure in the next paragraph 
and then describe different possible choices for the family of candidate graphs 

g. 

2.1. Selection procedure 

We assume here that we have at hand a family Q of candidate graphs, which all 
have a degree smaller than n — 2. To select a graph G among the family g, we 
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use the selection criterion introduced in [11]. We briefly present this criterion 
here and refer to [11] for further details. We write 9 for the p x p matrix such 
that 

E s [X a \X b , b^a]=J2 ^bX b and 9 a , a = for all a G {1, . . . ,p} . 

The matrix 9 minimizes ||S 1//2 (7 — 6>')|| pxp over the set of p x p matrices 9' 
with on the diagonal. Since X T X/n is an empirical version of S, an empirical 
version of ||S 1 / 2 (I — 9)\\ pX p is ||X(I — 9)\\ nxp divided by ^fn. Therefore, for any 
graph G in Q 7 we associate an estimator 9q of 9 by setting 

9 G = argmm{||X(7 - 6')\\ nXp : 9' e 9 G } , (1) 

where Qq is the set of p x p matrices 9' such that 9' a b is non-zero if and only if 
there is an edge between a and b in G. 

Finally, we select a graph G in Q by taking any minimizer over Q of the 
criterion 



v r 



Crit(G) = 



a=l 



\X a -X[9 G ] a \\i 1 



pcn[d a (G)] 
n-d a (G) 



(2) 



where d a (G) is the degree of the node a in the graph G and the penalty function 
pen : N — > M. + is of the form of the penalties introduced in Baraud et al. [2] . To 
compute this penalty, we define for any integers d and TV the DKhi function by 

/ x \ x ( N +2 \ 

DKhi(d, N,x)=¥[ F d+2 , N > — — F d , N+2 > -^rrx ),x>0, 



d+2J d \ " ,J,T " - Nd 

where F^n denotes a Fisher random variable with d and N degrees of freedom. 
The function x DKhi(d, N, x) is decreasing and we write EDKhi[d, N, x] for 
its inverse, see [2] Sect. 6.1 for more details. Then, we fix some constant K > 1 
and set 



fl — A 

pen(d) = K EDKhi 

n — d — 1 



1 ( P d 1 ' 



(3) 



When d remains small compared to n, the penalty function increases approx 

v 2 



imately linearly with d. Actually, when d < yn/ [2 (1.1 + yTogp) ) for some 
7 < 1, we approximately have for large values of p and n 



pen(d) < K(l + e*yj2\ogp) (d + 1 



2 



see Proposition 4 in Baraud et al. [2] for an exact bound. 

The selection procedure depends on a dimcnsionlcss tuning parameter K. 
A larger value for K yields a procedure more conservative. In theory (and in 
practice) K has to be larger than one. We propose to set K between 2 and 3 
depending on the desired level of false discovery rate of edges. 
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2.2. Family Q of candidate graphs 



The computational complexity of the minimization of the criterion (2) over the 
family Q is linear with respect to its size. In particular, minimizing (2) over 
all the graphs with degree smaller than some integer D, as proposed in [11], is 
untractable when p is larger than a few tens. To overcome this issue, we propose 
to build a much smaller (data-driven) family Q of candidate graphs, with the 
help of various fast algorithms dedicated to graph estimation. 

We present below four way to build a family of candidate graphs, chosen on 
the basis of theoretical results and simulation studies. They will be denoted by 
Qqe, <7coi, <?la, and Gew- Even if the procedure applies for any family Q, we 
advise to choose in practice one of these four families or the union of them. 

In the following, we describe these four families, provide algorithms to com- 
pute them efficiently, and discuss their computational complexity and their size. 
Each family depends on an integer D, smaller than n — 2, which corresponds to 
the maximal degree of the graphs in this family. 



Quasi-exhaustive family Qqe 



Description. Roughly, the idea is to break down the minimization of the crite- 
rion (2) over all the graphs of degree at most D into p independent problems. 
For each node a G T, wc estimate the neighborhood of a by 

he(a) = argmin |||X - Pvo] Vs (X a )\\l (l + ^§f ) ■ S C T \ {a} and \S\ < 

where pen is the penalty function (3) and Projy denotes the orthogonal pro- 
jection from E™ onto V s = {X/3 : (3 G W and supp(/3) = S}. We know from 
Verzelen [21] that ne(a) is a good estimator of the true neighborhood of a, from 
a non-asymptotic point of view. We then build two nested graphs Gif, an d and 
Gk,ot as in Meinshausen and Biihlmann [17] 

a i kj md b <=> a G hc(6) and b e ne(a) , 
a b <^=> a G ne(6) or b G nc(a) , 
and define the family Qqe as the collection of all the graphs that lie between 

G/f,and and Gk,ot 

Qqe = {G, Gk.^a C G C G K .or and dcg(G) < £>} . 

It is likely that the graph G ex haustive which minimizes (2) over all the graphs of 
degree at most D belongs to the family Qqe- In such a case, the minimizer Gqe 
of the criterion (2) over Qqe coincides with the estimator G ex haustive of [11]. 
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QE Algorithm 

1. Compute ne(a) for all a G T. 

2. Compute the graphs Gk, and and Gx.m- 

3. Work out the family £/qe- 

Complexity. The complexity of the computation of the collections he(a) is much 
smaller than the complexity of the computation of Gexhaustivc- Nevertheless, it 
still remains of order np D+1 D 3 and the size of the family C/qe can be of order 
2P D / 2 i n the worst cases. However, for sparse graphs G%, the graphs Gk, and 
and Gk,oi are quite similar in practice, which makes the size of Gqe much 
smaller. The procedure then remains tractable for p and D reasonably small. 
Computational times for some examples are given in Section 4.1. 

C01 family £coi 

Description. The family Qcoi derives from the estimation procedure proposed 
in Willc and Biihlmann [23] and is based on the 0-1 conditional independence 
graph Goi- This graph is defined as follows. For each pair of nodes (a, b), we 
write i? O; h|0 for the correlation between the variable X a and Xt> and i? ai b| c for the 
correlation of X a and X}, conditionally on X c . Then, there is an edge between 
a and b in Goi, if and only if ^ and R a ,b\c for all c € T \ {a, 6}, viz 

0^6 ^ min{|i? Qjb | c |, ce {0}ur\{a,6}} >0 . (4) 

Although the 0-1 conditional independence graph Goi does not usually coincide 
with the graph Gs, there is a close connection between both graphs in some 
cases (see [23]). Willc and Biihlmann [23] propose then to estimate the graph 
Goi in order to get an approximation of Gs • The following construction of the 
family Qoi derives from their estimation procedure. We write P(a,b\c) for the 
p- value of the likelihood ratio test of the hypothesis "R a ,b\c = 0" and set 

PmaxM) =max{P(a,6|c), c € {0} U T \ {a, b}} . 

For any a > 0, the graph Goi, Q is defined by 

a b Pmax(a, b) < a 

and the family Gcoi is the family of nested graphs 

<?coi = jGoi,a, a > and deg(Goi, Q ) < £>j . 

C01 Algorithm 

1. Compute the p(p — l)/2 values P m ax(ci, b). 

2. Order them. 

3. Extract from these values the nested graphs |Goi. q : a > o|. 

4. Stop when the degree becomes larger than D. 
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Complexity. The algorithm goes very fast since its complexity is of order np 3 . 
The size of the family Gcoi is smaller than pD. 

Lasso- And family C/la 

Description. The Lasso- And family £/la derives from the estimation procedure 
proposed by Meinshausen and Buhlmann [17] and is based on the LARS-lasso 
algorithm of Hastic et al. [8]. For any A > 0, we define the p x p matrix 9 X by 

A = Mgmin{||X-X0'||£ xp + A||0'||i : & e 6} , (5) 

where is the set of p x p matrices with on the diagonal and \\9'\\i = 
J2 a ^b Wab\- Then, we define the graph G A nd by setting an edge between a and 
b if both 0^ b and 8^ a arc non-zero, namely 

This graph G A nd is exactly the estimator (7) introduced by Meinshausen and 
Buhlmann [17]. We note that the size of G and has a tendency to increase when 
the tuning parameter A decreases. Hence, we define the family C?la &s the set of 
graphs G and with A large enough to ensure that deg(G and ) < D, viz 

Gla = , A > A and ,_D j , where A and ,D = sup |a, deg(G and ) > D j . 

From a computational point of view, the family (?la can be efficiently com- 
puted with the LARS-lasso algorithm. The optimization problem (5) is broken 
into the p independent minimization problems 

6% = argmin {||X a - Xvj| 2 + A||i>||i : nef and v a = 0} , for any a G I\ (6) 

with \\v\\ i = Ylb=i \ Vb \- When A decreases, the support of 9 X is piecewise constant 
and the LARS-lasso algorithm provides the sequences (A„)j>i of the values of 
A where the support of 6* changes, as well as the sequence of the supports 
^supp(# A »)^ . We work out the family (?la by gathering these p sequences as 
described below. 

LA Algorithm 

1. Compute with LARS-lasso the ^A^, supp(6* A " )j for all a G T. 

2. Order the sequence {X l a : a E T, I > l}. 

3. Compute G a ° d for all A^ > A an d,£>- 

Complexity. The complexity of the LARS-lasso algorithm is unknown in general. 
Nevertheless, according to Hastie et al. [8] the algorithm requires 0(np(n Ap)) 
operations in most cases. Hence, the whole complexity of the LA algorithm is 
generally of the order p 2 n(n A p). Finally, the size of the family C/la cannot be 
bounded uniformly, but it remains smaller than pD in practice. 
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Adaptive lasso family Gew 

Description. The family Gew is a modified version of Gla inspired by the 
adaptive lasso of Zou [26]. The major difference between Gew and Gla lies 
in the replacement of the I 1 norm \\9'\\i in (5) by ||0'/0 mit ||i, where 9 nut is 
a preliminary estimator of 9 and 9' /9 lnlt stands for the matrix with entries 
(9 / /9 lnit ) a , b = 6' a b /6^. Zou suggests to take for 9 ,nit a ridge estimator. Here, 

we propose to use instead the Exponential Weights estimator 9 EW of Dalalyan 
and Tsybakov [5, 6]. The choice of this estimator appears more natural to us 
since it is designed for the sparse setting and enjoys nice theoretical properties. 
Moreover, we have observed on some simulated examples, that the adaptive 
lasso with the Exponential Weights initial estimator performs much better than 
the adaptive lasso with the ridge initial estimator. 

To build the family Gew we thus start by computing the Exponential Weight 
estimator 9 EW . For each a 6 T, we set H a = {v e W : v a — 0} and 

E a W = ! ve-'\\*--*»*l J] (1 + («i/r)T a ^ , (7) 

JH a ■ *a 

with Z a = f Ha e -0\\^a-xv\\l jj. (i + (vj/rfY" dv and a,/3,r > 0. We note 
that 9 EW with (3 = n/(2a^) and a 2 a = var(X a | X- a ) is simply the Bayesian 
estimator of 9 a with prior distribution dn(v) cx FJ . (l + {vj/r) 2 ) ° dv on H a . In 
the Gaussian stetting, Dalalyan and Tsybakov [5] give a sharp and assumption- 
free sparse inequality for 9 EW with (5 < n/(4a^), see Corollary 4 in [5]. 

The construction of Gew is now similar to the construction of Gla- For any 
A > we set 

qE\v,\ = argmin ||| X - X6'f nxp + X\\9'/e EW \\ 1 :«'ee), (8) 
and we define the graph Gqj W ' A by setting an edge between a and b if either 

nBW.X qEW.X ■ 

9b a 21 " a b 1S non-zero: 

G or W,A , 2EW,A / n 2EW.A / n 

a ~ b 9 a b £ or 9 b a ^ . 

Finally, the family Gew is given by 
Gew = {G™'\ A > A^} , where A™ = sup {a, dcg(G™' A ) > d) . 

The Exponential Weight estimator 9 EW can be computed with a Langevin 
Monte-Carlo algorithm. We refer to Dalalyan and Tsybakov [6] for the details. 
Once 9 EW is computed, the family Gew is obtained as before with the help of 
the LARS-lasso algorithm. 
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EW Algorithm 

1. Compute 9 EW with a Langcvin Monte-Carlo algorithm. 

2. Compute with LARS-lasso the (a„, supp(# A »)^ for all a e T. 

3. Order the sequence {A^ : a e T, / > l}. 

4. Compute Goi W,Aa for all > A or ,,D- 

Complexity. The complexity of the first step depends on the choices of the tuning 
parameters. Some examples are given in Section 4.1. The complexity of steps 2, 
3 and 4 is the same as for the LA-algorithm, of the order p 2 n(nAp) in practice. 
Finally, as for (Jla, we do not know a general bound for the size of Gew, but it 
remains smaller than pD in practice. 



3. Theoretical results 

We present in this section some theoretical results which assess the performance 
of our selection procedure. We present two kind of results: a non-asymptotic 
oracle-like inequality concerning the estimation of 9 and a consistency result for 
the estimation of G-£. 



3.1. A non- asymptotic oracle-like inequality 

Next theorem states an oracle- like inequality in the spirit of Theorem 1 in [11]. 
We associate to the graph G selected by the procedure of Section 2, the estimator 
9 = 9~ of the matrix 9, where 9q is given by (1) for any graph G e Q. The 

quality of the estimation of 9 is quantified by the MSEP of 9 defined by 



MSEP(<?) =E \\T} /2 {9- 9)\\ 2 p 



X p 



We refer to the introduction of Giraud [11] for a discussion on the relevance of 
the use of the MSEP of 9 to assess the quality of the estimator G. In the sequel, 
/ stands for the identity matrix of size p. 

Roughly speaking, next theorem ensures that the estimator 9 performs almost 

as well as the best estimator in the family j^c, G € (?|. 
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Theorem 3.1. Assume that n > 9. Let Q be any (data-driven) family of graphs 
with maximal degree D^— max{deg(G), G S G} fulfilling 



1<D$< 7 , for some -y<l. (9) 



Then, the MSEP of the estimator 9 is upper bounded by 



MSEP(0) < Ljc, 7 Iog(p) 



inf (msep(? g ; 

Gee v 



v MSEP 1£ ) \ 



n 



where Lk,^ is a positive constant depending on K and 7 only and the residual 
term R n = _/?„(£, 7) (made explicit in the proof) is of order n 3 ir(S)e^™^^~ 7 '' / 4 . 

If we forget the term n _1 MSEP(7) in (10), Theorem 3.1 states that under 
Condition (9) the MSEP of 9 nearly achieves, up to a log(p) factor, the average 
minimal MSEP of the family of estimators {Oq, G <G G}- Hence, 9 performs 
almost as well as the oracle up to a \ogp factor. This logarithmic factor is 
proved to be unavoidable from a minimax point of view (see [21] Sect. 4.2). 

Let us now compare the additional term ti~ 1 MSEP(7) appearing in (10) 
with the risk MSEP(#g). This additional term is equal to n^ 1 °~ai where a\ 
stands for the conditional variance of X a given the remaining variables. Hence, 
this quantity is usually smaller than the risk MSEP(#g) which is the sum of a 
bias term and a variance term of order n^ 1 ^a{G)o-^. Nevertheless, when the 
true graph G*s is empty and G contains the empty graph, the additional term 
n _1 MSEP(7) is dominant and the estimator 9 is not optimal. Such a drawback 
is actually unavoidable in model selection when the target is too close to zero 
(sec Birgc and Massart [4] Sect. 2. 3. 3 for a discussion). 

Finally, we can compare the MSEP of 9 to the MSEP of 9q s when G^ belongs 
to G with large probability. Roughly speaking, the MSEP of 9 is in this case 
smaller (up to a logp factor) than the MSEP of 9g s - This means that 9 performs 
almost as well as if wc knew the true graph Gs in advance . 

Corollary 3.2. Under the assumption of the above theorem, if the minimal 
graph Gs belongs to the family G with large probability 

P (G s g G) > 1 - L(a) cxp{-[3n s ), for some 01,(3, 8 > (11) 

then, the MSEP of the estimator 9 is upper bounded by 

MSEP(0) < L Kn log(p) (MSEP(0 Gs ) V MS ^ P(/) ) + R n . (12) 
where the residual term R n — i?„(S, 7, a, (3, 5) is of order n 3 tr(S)[e~™^v / T 7- ')') / 4 -|- 

v/ZRe-!" 5 ]. 
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3.2. Consistency of the selection procedure 

The next theorem states, under mild assumptions, a consistency result for our 
selection procedure in a high dimensional setting. In the spirit of the results 
of Meinshausen and Biihlmann [17], we consider the case where the number of 
variables p increase with the sample size n. 

We make the following assumptions: 
(H.l) Pn >n. 



(H.2) dcg(G s J < - — A - — 5 — for somc s < 1 . 



n n 

, A 5- 

l0gp„ log p n 

(H.3) min 6 2 a b min ^fefej > n s ' - 1 for somc s' > s 

a^b, &Gno Gs (a) ' a^b Var(X b \X_ b ) 



Theorem 3.3. Assume that the family Q of candidate graphs contains the true 
graph with probability going to 1 and (H.l), (H.2), (H.3) are fulfilled. Then, 

the estimation procedure GGMselect with K > 



3V 



2.5 



U 

D p = max{deg(G), G G Q} < 



is consistent. More precisely, there exist some universal constant L and some 
integer no = no [K, s, s'] not depending on the true graph Ge„ nor on the co- 
variance £„ such that 



G — 



> 1 - Lp , 



-1/2 



for any n > no . 



Let us discuss the assumptions of the theorem and their similarity with 
some of the hypotheses made in Meinshausen and Biihlmann [17]. The Assump- 
tion (H.2) is met if p n grows polynomially with respect to n and the degree 
of the true graph does not grow faster than n K with n < s (which corresponds 
to Assumptions 1 and 2 in [17]). We mention that (H.2) is not satisfied when 
p n grows exponentially with n unless Gs n is empty. It is actually impossible to 
consistently estimate a non-empty graph if p n is of order exp(n), see Wainwright 
[22]. 

The Assumption (H.3) ensures that the conditional variances as well as the 
non-zero terms 9 a .b are large enough so that the edges can be detected. To 
compare with [17], Assumption (H.3) is met as soon as Assumption 2 and 5 
in [17] are satisfied. In addition, we underline that we make no assumption on 
the Zi-norm of the prediction coefficients or on the signs of 9 a ,b (Assumption 4 
and 6 in [17]). 

Finally, we do not claim that the condition K > [2.5/(1 — s) V 3] is minimal 
to obtain consistency. It seems from simulation experiments that smaller choices 
of K also provide good estimations. 
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4. Numerical study 

It is essential to investigate the performance of statistical procedures on data. 
Since we do not know the actual underlying graph of conditional dependences 
on real data sets, we opt for a numerical study with simulated data. Our aims 
in this study are to evaluate the feasibility of the GGMselect procedure and to 
compare its performances with those of recent graph-selection procedures. 

Simulating the data The matrix X is composed of n i.i.d. rows with Gaus- 
sian A/" p (0, f2 _1 ) distribution where the inverse covariance matrix fl is con- 
structed according to the following procedure. We set £1 = BB T + D, where 
B is a random sparse lower triangular matrix and D is a diagonal matrix with 
random entries of order 10~ 3 . The latter matrix D prevents from having too 
small eigenvalues. To generate B we split {1, . . . ,p} into three consecutive sets 
h, I3 of approximately equal size, and choose two real numbers 77i nt and 
Tjext between and 1. For any a, b such that 1 < a < b < p, we set B a \> = 
with probability 1 — ?yj nt if a and b are in the same set, and we set B a b = 
with probability 1 — ?7 cxt if a and b belong to two different sets. Then, the lower 
diagonal values that have not been set to are drawn according to a uniform 
law on [—1, 1] and the diagonal values arc drawn according to a uniform law on 
[0,e]. Finally, we rescale O in order to have 1 on the diagonal of S = Q . This 
matrix £ defines a graph G = Gs and a matrix 9 defined as in Section 2.1. The 
sparsity of the graph is measured via a sparsity index noted I s , defined as the 
average number of edges per nodes in the graph. 

In our simulation study we set 77 = rji n t = 5?7cxt, and e = 0.1. We evaluate the 
value of T) corresponding to a desired value of the sparsity index / s by simulation. 
Choosing I s small, we get sparse graphs whose edges distribution is not uniform, 
see Figure 1. 

GGMSelect: choice of graphs families Our procedure is applied for the 
families of graphs presented in Section 2.2. The methods are respectively denoted 
QE, C01, LA and EW. 

As it was already mentioned in Section 2.2, the size of the family Gqe may 
be very large leading to memory size problems in the computational process. 
In that case, as soon as a memory size problem is encountered, the research 
between Gk^tlA and Gk.w is stopped and prolonged by a stepwise procedure. 

The family Gew is based on the calculation of exponential weight estimators 
9 EW . This calculation depends on parameters, denoted a,/3,a,T in [6], that 
defined the aggregation procedure, and on parameters, denoted h and T in [6], 
used in the Langevin Monte-Carlo algorithm. We chose these parameters as 
follows. The matrix X being scaled such that the norm of each column equals 1, 
we took a = and we set a = 0, = 2/n, r = l/y/n(p— 1) and h = 10~ 3 , 

T = 200. Using these parameters values we did not encountered convergence 
problems in our simulation study. 

Our procedure depends on two tuning parameters: K occurring in the penalty 
function (see Equation 3) and D the maximum degree of the graph. We choose 
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Figure 1. One simulated graph G with p = 30 and I s = 3. The degree deg(G) of the graph 
equals 8. 

K = 2.5 in all simulation experiments. In practice, we would like to choose D 
as large as possible, and this can be done for all methods except QE where the 
complexity of the algorithm increases exponentially with D. 

All these methods are implemented in R-2.7.2 in the package GGMselect 
available on request. 

4.1. CPU times 

We assess the practical feasibility of the methods we propose from the point of 
view of the memory size and computer time. To this aim, we simulate graphs 
with p = 30, 100, 200, 300 nodes, sparsity I s = 3 and n = 50. The simulation 
were run on a Bi-Pro Xeon quad core 2.66 GHz with 24 Go RAM. The computer 
time being strongly dependent on the simulated graph we calculate the mean of 
computer times over Nq = 100 simulated graphs. For each of these graphs, one 
matrix X is simulated. The results are given in Table 1. The maximum degree 
D of the estimated graph was set to 5, except for the QE method where D = 3 
and 5. The maximum allowed memory size is exceeded for the QE method when 
D = 5 and p = 100, 200, and when D = 3 for p = 300. The LA and C01 methods 
are running very fast. The computing time for the EW method increases quickly 
with p: in this simulation study, it is roughly proportional to cxp (y^i/2) , see 
Figure 2. This order of magnitude is obviously dependent on the choice of the 
parameters occurring in the Langevin Monte-Carlo algorithm for calculating 

qEW 
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p 


QE D = 3 


QE D = 5 


EW D = 5 


LA D = 5 


COl D = 5 


30 


16 [1.9,1366] 


146 [125,975] 


7.4 [6,9.3] 


0.47 


0.05 


100 


1956 [240, 5628] 


>ams 


112 [103, 121] 


3.05 


0.13 


200 


4240 [4008, 5178] 


>ams 


856 [813, 943] 


8.0 


0.65 


300 


>ams 


>ams 


4305 [4240, 4444] 


14.7 


2.12 



Table 1 

Means and ranges (in square brackets) of computing times in seconds calculated over 
Nq = 100 simulated graphs. For LA and COl there is no variability in the computing times. 
>ams means that the maximum allowed memory size was exceeded. 




Figure 2. Graphic oflog 2 (CPU time) versus p for the EW method. 

4-2. Methods comparison 

We compare our methods with the following ones: 

• the 0-1 conditional independence approach proposed by Wille and Biihlman [23], 
with the decision rule based on the adjusted p- values following the Benjamini- 
Hochberg procedure taking a = 5%. 

• the lasso approach, with the two variants and and or proposed by Mein- 
shausen and Buhlmann [17] , taking a = 5%. 

• the adaptive glasso method proposed by Fan et al. [9]. It works in two 
steps. First, the matrix fl is estimated using the glasso method. Then the 
glasso procedure is run again using weights in the penalty that depend 
on the previous estimate of ft, see Equation (2.5) in [9]. At each step the 
regularization parameter is calculated by K-fold cross-validation. 

These methods will be denoted as WB, MB. and, MB. or and Aglasso. They were 
implemented in R-2.7.2 using the packages lars for the MB methods and the 
package glasso for the last one. 

Assessing the performances of the methods We assess the performances 
of the investigated methods on the basis of Nq x Nx runs where is the 
number of simulated graphs and Nx the number of matrices X simulated for 
each of these graphs. We compare each simulated graph G with the estimated 
graphs G by counting edges that are correctly identified as present or absent, 
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and those that are wrongly identified. We thus estimate the false discovery rate 
(or FDR) defined as the expected proportion of wrongly detected edges among 
edges detected as present, and the power defined as the expected proportion of 
rightly detected edges among edges present in the graph. 

The statistical procedures designed to select graphs have one or several pa- 
rameters that must be tuned. The quality of the final estimation is then affected 
as well by the intrinsic ability of the procedure to select an accurate graph, as 
by the parameter tuning. First, we investigate the first issue by varying the val- 
ues of the tuning parameters and plotting power versus FDR curves. We choose 
p = 100, n = 50 and I s = 3. Then, taking the point of view of a typical user, we 
compare the different procedures with the tuning parameter recommended in the 
literature. We investigate the effect of n by choosing n = 30, 50, 100, 150, keep- 
ing p = 100. We also evaluate the effect of graph sparsity taking I s = 1, 2, 3, 4, 5, 
p = 30 to keep the computer time under reasonable values, and n = 30. 

4-2.1. Power versus FDR curves when p =100 

The number of nodes p and the number of observations n being fixed to p — 100, 
n = 50, for each of the Nq = 20 simulated graphs, we estimated the FDR, the 
power and the MSEP on the basis of Nx = 20 simulations. These calculation 
are done for different values of the tuning parameter. The means over the Nq 
graphs are shown at Figure 3. The standard errors of the means over the Nq 
graphs are smaller than 0.0057 for the FDR, and 0.018 for the power. 

Choice of the family of candidate graphs in our procedure The QE 

method presents good performances: the FDR stays small and the power is high. 
Though it was performed with D = 3, while EW, LA and C01 were performed 
with D = 5, it works the best. The EW method is more powerful than LA and 
C01 if one accepts a FDR greater than 2.5%. 

Comparison with the other methods The procedures LA and C01 behave 
similarly to WB method. The MB . or method presents higher values of the power 
when the FDR is larger than 5%. The MB . and keeps down the FDR but lacks of 
power. The Aglasso method behaves completely in a different way: the curve 
stays under the others as long as the FDR is smaller than 20%. When the 
regularization parameter is chosen by 5-fold cross-validation, the power equals 
59% at the price of a very large FDR equal to 90% (not shown). In the following 
we do not consider anymore the adaptive glasso method, and focus on methods 
that have a good control of the FDR. 

4-2.2. Effect of the number of observations n 

Keepings = 100 and I s = 3, the variations of the FDR and power values versus 
the number of observations, are shown at Figure 4. The QE method is applied 
with D = 3 while EW, LA and C01 are applied with D = 5. For all methods the 
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Power versus FDR 




Figure 3. Graphics of power versus FDR for the case p = 100, n = 50 and I s = 3. The 
marks on the first graphic correspond to different values of the tuning parameter. The curves 
for small FDR values are magnified on the second graphic. The FDR and power values cor- 
responding to the tuning parameter recommended in the literature are superimposed on the 
curves (dashed lines) : K = 2.5 for GGMselect, a = 5% for WB and MB methods. For Aglasso, 
with X chosen by 5-fold cross-validation, the FDR eguals 0.90 and the power eguals 0.59 (not 
shown). 
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power increases with n while the FDR decreases for EW and increases for MB . or, 
LA and C01. QE and EW are the most powerful. When n is small, the QE method 
stays more powerful than EW in spite of a smaller D. 



FDR versus n POWER versus n 

CO 




Figure 4. FDR and power estimated values as a function of n for p = 100 and I s = 3. The 
results are calculated on the basis of Nq = 20 simulated graphs and Nx = 20 runs of matrices 
X for each simulated graph. Our procedures were carried out with K = 2.5. The value of D 
was equal to 3 for the QE method and 5 for the others. For the procedures MB . or, MB . and and 
WB the tuning parameter a was taken equal to 5%. 



4-2.3. Effect of graph sparsity 

We have seen that when p is large, the GGMselect procedures using the graphs 
families QE and EW are powerful and have a good control of the FDR. Never- 
theless, the simulated graphs were sparse, I s = 3, and it may be worthwhile 
testing how the methods perform when the graph sparsity varies. Because the 
performances depend strongly on the simulated graph, the FDR and power are 
estimated on the basis of a large number of simulations: the number of simulated 
graphs Nq equals 50 and the number of simulated matrices X for each graph, 
Nx equals 50. In order to keep reasonable computing times, we choose p = 30. 
The results are shown at Figure 5. The standard errors of the means over the 
Ng graphs are smaller than 0.0055 for the FDR, and 0.025 for the power. 

For all methods the power decreases when I s increases. The FDR values are 
slightly increasing with I s for the EW and MB . or methods. The superiority of QE 
over the others is clear. EW is more powerful then LA, C01, MB and WB methods 
but its FDR is greater. 
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FDR versus sparsity 



POWER versus sparsity 




CD 

d 



ip 



CO 



CM 







QE 






- Q- - EW 






+ LA 






■->«•■ C01 
-A- MB. and 
-- MB. or 
-~e— WB 








\ X v v 

\ V 






\ >i 
\ V_ 
\ m 


















1 2 


3 


4 5 



Figure 5 . Graphs of FDR and power estimated values versus the graph sparsity I s , for p = 30 
and n = 30 . The results are calculated on the basis of Nq = 50 simulated graphs and Nx = 50 
runs of matrices X for each simulated graph. Our procedures were carried out with K = 2.5 
and D = 5. For the procedures MB. or, MB. and and WB the tuning parameter a was taken equal 
to 5%. 



4-2.4- GGMselect : mixing the graphs families 

Our procedure allows to mix several graphs families. It may happen that some 
graphs, or type of graphs, are known to be good candidates for modelling the 
observed data set. In that case, they can be considered in the procedure, and thus 
compete with few or Gqe- This can be done with the function selectMyFam of 
the package GGMselect. 

Considering the results of our simulation study, we could ask if mixing Gla or 
Gcoi with Gew would not give a better control of the FDR than EW while keeping 
high values of the power. To answer this question we carried out simulation 
studies taking Gmix = Gcoi^Gla^Gew a s the family of graphs. In all considered 
cases for p, n, I s , the FDR and power values based on Gmix arc similar to those 
based on Gew- This result can be explained by studying the behavior of the 
MSEP estimated by averaging the quantities \\T, 1 I 2 {6^6)\\ 2 over the Nq x Nx 
runs. The results are given at Figure 6. One can see that the smallest values of 
the MSEP are obtained for QE, then EW. Moreover, the MSEP decreases when the 
power increases, while it does not show any particular tendency when the FDR 
varies. Considering these tendencies together with the fact that our procedure 
aims at minimizing the MSEP, we can understand why we do not improve the 
performances of EW by considering Gmix- 
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Figure 6 . Values of the MSEP for the simulation results given at Figure 5. The first graphic 
on the left presents the ratio of the MSEP over the MSEP of the QE method. The two others 
present the MSEP versus the FDR and the power. 



4-3. Summary 

We recommend to use the QE method if the calculation of Gx,and and Gk,oi 
is possible. Next, working out the family Gqe can always be done using some 
suitable algorithms if necessary (as a stepwise procedure for example). When p 
is large, QE can be used for small values of D (D = 3 or even D = 2). It may 
perform better than all the others when n is small. The procedure based on (Jew 
can be used for large p: the gain in power over LA, C01, MB and WB methods is 
significant, but the FDR is slightly greater. The LA and C01 methods are running 
very quickly, keep the FDR under control and arc slightly more powerful than 
WB and MB . and. 



5. Proofs 

5.1. Proof of Theorem 3.1 

We write Gd for the family of all the graph with nodes in T and degree less than 
D. We remind the reader that for any graph G € Gd we have noted @g the 
space of p x p matrices 9 such that 9 a .b is non zero if and only if there is en edge 
between a and b in G. We also set 8_D max = Uaeg n X @G- We set A = (1 — y/y) 2 
and introduce the event 

B = |Al|S 1 / 2 A|| pxp < -±=\\XA\\ nXp < A-^IE^AHpxp, for all A e 9 + e A _ J 
On this event we can control the L 2 -loss of 9 by the empirical loss since 

l|£ 1/2 (<? - 0)\\ 2 pxp i* < —WMO o)\\l xp i M . (13) 
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Furthermore, according to Lemma 1 in [11], we have P(B C ) < 2e" n( v^~ 7)2/2 
when Condition (9) is met. To bound the risk of the procedure, we consider 
apart the events B and B c . 



5.1.1. Bound on E 



\W 2 0-e)\\ 2 pxp i t 



We have X = X# + e, where e is a n x p matrix distributed as follows: for each 
a G r, the column e a is independent of X_ a and is distributed according to the 
Gaussian law Af(0,a^I n ), with cr 2 = l/0 0)0 . For any G e Qd, we write hence- 
forth 9 G for the orthogonal projection of 9 on @c according to the Euclidean 
norm US 1 / 2 • || pxp on W xp . Similarly, we write 9 G for the orthogonal projection 
of ^ on 9g according to the (random) Euclidean norm ||X • \\ nX p on l pxp . For 
any G € Qd, we write d a (G) for the degree of the node a in G and introduce 
the positive quantity 

R{G) = f) ( 1 + P ™} d d a a f^ ) (||X(*„ - 9 G )f + 2| < Xtf a - X^, 6, > |) 



pen(d a (G))n ,, 2 



^ n-d a {G) 

a— 1 v ' 



where |.| and < ., . > denote the canonical norm and scalar product on W 1 . 
Following the same lines as in the beginning of the proof of Theorem 2 in Baraud 
et al. [2], we get for any G* in Q 



1 



X(0 - 9)\\f lxp l B < R{G*)1 M + A(G)1 B (14) 



with 



A(G) = |> 2 (Vc/ neo(a) - P ^ ( ( 5 ) ) Ke G(o) ) + 

where t/ neG (a) an< i ^ne G (a) are two independent x 2 random variables with d a (G)+ 
1 and n — d a (G) — 1 degrees of freedom. 

We note that under Condition (9) there exists some constant c( 7 ) depending 
on 7 only, such that 

pen(d) < c(i)K{d + 1) Iog(p), for all d G {0, . . . , D max } , 

see Proposition 4 in Baraud e< [2]. In particular, we have for any G € £7zj 

peil(d a (G)) < C( 7 )/Y (Anax + 1) log(p) 



< 4^70(7) = Ly t 



n-d a (G) ~ n/2 
Using this bound together with 

|2 < Xtf - X9 G , e a >\< ||X(0 O - 0" a G )|| 2 + 
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where for any G S Q and a G {1, . . . , p}, the random variable 
£ 0)G =< X(# a - ^),e > /(a o ||X(0 a - 9 G )\\) 
is standard Gaussian, we obtain 

R(G) < (1 + L,, K ) jr ( 2 ||X(fl„ - gf)H 2 + <&2, G ) + Kll 2 

< 2(1 + L 7> jc)||X(0 - e G )f nXp + (4 + L 7iJf ) £ pcn(d a (G))a 2 a + r{Q D ) 

a=l 

where 

t{Gd) = ^> 2 ( (! + L 7,a-) E K,G-pen(4(G))] + + L 7)X [||e a ||>a - 3n/2] . 

a=l V GeS 

Furthermore, we have ||X(0 — G )\\ nxp < ||X(^ — 9 G )\\ nxp and on the event B 
we also have ||X(0 - 9 G )\\ 2 nxp < rzA" 2 1| S 1 / 2 (6» - 9 G )\\ 2 pxp so that on B 

R(G) < L' y>K L\- 2 \\^ 2 (6 - 9 G )\\ 2 pXp + J2 pen(d a (G))al\ + r(£fo), 

with L' K = max(2 + 2L 7i x, 4 + L 7) if)- Putting this bound together with (13) 
and(14), we obtain 



K 

n\*{K- 1) I G T e > 



< L; A . inf^ HE 1 /^-^')^ + j- pen(da(G! . )) 2 a 
G*eS \ 

(r(fe)+A(G) 

We note that 



l E(r(Q D )) < £ ^f(l + ^)(3 + log(p)) 



and we get from the proof of Theorem 1 in [11] that 

ra _1 E(A(G)) < n _1 E ( sup A(G) ] 

\GGSd / 
< lf^(l + log(p)). 
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Since pen(d) < c(j)K(d+ l)log(p), the latter bounds enforce the existence of 
constants L~ h K and L'^ K depending on 7 and K only, such that 



E 



|£ 1/a (0- 61)11^1, 



G*ea \ 
< L' K log(p) f E 



inf [\\^{6-9 G ')\\l xp - 



inf_MSEP(6» G .) 

G*ee 



]T(log(p)Vpcn[d Q (G*)])^ 

a—1 
P rr 2 \ 



Finally, we note that YZ=i a ll n = MSEP(I). 



5.1.2. Bound on E 



|E 1 / 2 (6» - 8)\\l x l R o 



We now prove the bound E 



||EV2(0 - 6)\\ 2 pxp l B o < Ln 3 tr(E)v^P(i=). We have 



E 



|£ 1/2 (^ - 9)\\ 2 pxp Uc] = ^E [HE 1 / 2 ^ - a )|| a l,. 



a=l 



and wc will upper bound each of the p terms in this sum. Let a be any node in T. 
Given a graph G, the vector [6c]a depends on G only through the neighborhood 
nee (a) of a in G. Henceforth, we write 9 ne ^i a ) for 9 a in order to emphasize 

^ G 

this dependency By definition # n( w a ) is the least-squares estimator of 9 a with 

G 

support included in ne^a). Let us apply the same arguments as in the proof of 
Lemma 7.12 in [21]. By Cauchy-Schwarz inequality, we have 



E 



||s 1/2 (0 Q - e a )\\ 2 i w < yp(F) Je ||£V2(0 : 



G 



(15) 



Let Nd{o) be the set made of all the subsets of T \ {a} whose size are smaller 
than 7n/[2(l.l + ^log(p)) 2 ]. By Condition (9), it holds that the estimated 
neighborhood ncg(a) belongs to A/z^a), so Holder inequality gives 



E 



G 



E > 

ne(a)SAAr>(a) 

* E > 

ne(a)GWi}(a) 



E E 

1/u 



nc^(a) = ne(a) 



neg(a) = ne(a) 



(a)=ne(a)l|£ (#ne(a) — #a)|| 
l/v 



1/u 



sup E 

nc(a)£A/i} (a) 



s(a) 



%)\ 



it' 



l/v 
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where v = \_^\, and u = (we remind the reader that n is larger than 8). In 
particular, we have the crude bound 



E 



\\^ 2 (0 ne ^ a )-0a)\ 

G J 

< [C&id(Af D (a))} 1/2v 



sup E 

ne(a)eW D (a) 



\\^ 1/2 (9 nc{a ^0 a )\\ 



4 u 



l/2v 



since the sum is maximum when every P[ne(a) = neg(a)] equals [Card(A/i)(a))] 1 

We first bound the term [C&rd(Af D (a))] 1/2v . The size of the largest subset in 
A/r>(a) is smaller than n/(2 log(p)), so the cardinality of A/r>(a) is smaller than 
p o . Since n is larger than 8, we get 

[Card(A^(a))] 1/2 " < cxp - 

which ensures the bound 

1 < L 



4K8J 



E 



||£V2(0 : 



G 



sup E 

ne(a)GWr)(a) 



\\^ 1/2 (0 ac (a)-e a )\\ l 



l/2v 



(16) 

To conclude, we need to upper bound this suprcmum. Given a subset ne(a) 
in Afo(a), we define # no (a) as the vector in R p such that S 1 / 2 6' nc ( a ) is the or- 
thogonal projection of E 1 / 2 ^ onto the linear span {E 1 / 2 /? : supp(/3) C ne(a)}. 
Pythagorean inequality gives 



'nc(a) - u a)\\ — ||^ ^nc(a) ~ u a ) | 1" 

and we obtain from Minkowski's inequality that 

1/(2*0 



ne(a) 



e(a), 



E 



\^ 1/2 (0nc(a)-e a )\\ 



4u 



< ||£ 1/2 (# n c(a) - 0a)f +E \\\^ 2 (0 nc{a} - ne(a) )|| 



l/(2v) 



The first term is smaller than V&r(X a ). In order to bound the second term, we 
use the following lemma which rephrases Proposition 7.8 in [21]. 

Lemma 5.1. For any neighborhood nc(a) and any r > 2 such that n — |ne(a)| — 
2r + 1 > 0, 



E 



l/r 



S 1/2 (0 no (a) ~ nc(a) )\\ 2r < Lr\ne(a)\nVar(X a ) 



Since v is smaller than n/8 and since |ne(a)| is smaller than n/2, it follows 
that for any model ne(a) G Afp (a), n — |ne(a)| — Av + 1 is positive and 



E 



IE 1 / 2 * 



e(a) 



4r 



l/(2„) 



< Var(X a ) [1 + Ln 2 v] < Ln 3 E a , a 
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Gathering this last upper bound with (15) and (16), we get that 
\^ 2 (9- 



E 



5.1.3. Conclusion 



\pxp 



< Ln 3 tr(E) v / P(B c ). 



Finally, putting together the bound on E[||E 1//2 (# — 6>)|| 2 1b], the bound on 
E[||S 1 / 2 ((9 - <9)|| 2 l B c], and the bound P(B C ) < 2 P e~ n ^-^ 2 / 2 , we obtain 



MSEP(0) <L Kn \og( P ) | 
with R n < J L7i 3 tr(S)e-"(v / 7-/)'V4_ 



inf fMSEP(0 G )) 



GeG 



5.2. Proof of Corollary 3.2 

The result is proved analogously except that we replace the event B by 

b' = bu{g e e^} . 

Hence, the residual term now satisfies 
Rn < Ln 3 tr(E)^P(B c ) 



< Ln 3 tr(E) e -" ( ^ 7) " /4 + y/Z(a)t 



5.3. Proof of Theorem 3.3 

In this proof, the notations o(l), O(l) respectively refer to sequences that con- 
verge to or stay bounded when n goes to infinity. These sequences may depend 
on K, s, s' but do not depend on G„, on the covariance E, or a particular subset 
S C T. The technical lemmas are postponed to Section 5.4. In the sequel, we 
omit the dependency of p and E on n for the sake of clarity. First, observe that 
the result is trivial if n/log(p) 2 < 1, because the assumptions imply that Gs 
is the empty graph whereas the family Q contains at most the empty graph. In 
the sequel, we assume that n/ log(p) 2 > 1. 

Let us set D max = n/ log(p) 2 . We shall prove that for some L > 0, 



P (Crit(G s ) 



inf 

G' , dog(G')<-D n 



Crit(G') > 1 - Lp 



-1/2 



(17) 



for n larger than tiq(K, s, s'). Since G minimizes the criterion Crit(.) on the 
family Q , this will imply the result of the theorem. 
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In fact, we shall prove a slightly stronger result than (17). Let a be a node 
in r and let ne(a) be a subset of T \ {a}. As defined in Section 5.1.2, # no ( a ) is 
the least-squares estimator of 9 a whose support is included in ne(ct). 



9 nc(a) =arg inf ||X o -X0' o | 

&' a . supp(p^)Cnc(a) 



If G is a graph such that the neighborhood ne G (a) equals ne(a), then # ne ( a ) 
[0g]o,- We then define the partial criterion Crit(a, ne(a)) by 



Crit(a,ne(a)) = |jX a - X6» ne(a) ||„ 



pen(|ne(a)|) 
n — |ne(a)| 



Observe that for any graph G, Crit(G) = J2a=i Crit(o, ne G (a))- We note 
he(a) the set that minimizes the criterion Crit(a, .) among all subsets of size 
smaller than D ma , x . 

he(a) = arg inf Crit(a, ne(a)) . 

nc(a)eAf Dmlx (a) 

If for all nodes a € T, the selected set ne(a) equals ne<3 E (a), then Gy. minimizes 
the criterion Crit(.) over all graphs of degree smaller than -D max - Consequently, 
the property (17) is satisfied if for any node a € T, it holds that 

P [ne(a) = nc Gs (a)] > 1 - 7p,; 3/2 , (18) 

for n larger than some no[K, s, s']. 

Let us fix some node a £ T. We prove the lower bound (18) in two steps: 

1. With high probability, the estimated neighborhood ne(a) does not strictly 
contain the true one ne<3 s (a). 

P [25(a) 2nc Gs (a)] < p~ 3 / 2 , (19) 

for n larger than some no[K, s,s']. 

2. With high probability, the estimated neighborhood he(a) contains the true 
one neG E (a). 

P[ne(a) ^ne GE (a)] <6p" 3 / 2 , (20) 

for n larger than some no[K, s,s']. 
The remaining part of the proof is deserved to (19) and (20). 

Let us recall some notations and let us introduce some other ones. The com- 
ponent X a decomposes as 



X a — X9 a + e, 
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where e a follows a centered normal distribution with variance f2~ \ — V&r(X a \X- a ). 
The variables e a are independent of X- a . Given a set S C T, lis stands for the 
projection of R™ into the space generated by (X a ) oe s, whereas Ug denotes the 
projection along the space generated by (X a ) ae s. The notation (., .)„ refers to 
the empirical inner product associated with the norm \\.\\ n . For any neighbor- 
hood ne(a) CT \ {a} such that |ne(a)| < D maK , let us define A(ne(a), nea s (a)) 
by 

A(ne(a), neG s (a)) = Crit(a, nc(a)) — Crit(a, neG s (a)) ■ 
5.3.1. Bound on P (ne(a) 2 ne G s («)) 

We shall upper bound the probability that A(ne(a), neG s (a)) is negative for 
at least one of the neighborhoods nc(a) G A/£> max (a) such that ne(a) strictly 
contains ne<3 s (a)- For such a set ne(a), A(ne(a), neG E (a)) decomposes as (see 
e.g. Lemma 7.1 in [21]) 



A(ne(a),ne Gs (a)) 

= l|nno(a) e a|ln 



pcn(|nc(a)|) 
n — |ne(a)| 



in 1 - e II 2 

l 11 iiOG 5; (a) fc a|ln 



in 



neG E (a) J -nnc(a) e a 



1 



pen(|ne Gs (a)|) 



|n ne ( a )e a ||„ 



n - |ne GE (a) 
pen(|ne(a)|) pen(|neG s (a)|) 



n— |ne(a)| n— |neG s (a)| 
Hence, A(m,neG s (a)) > if 



pen(|ne GE (a)|) 
n - |nc GE (a)| 



in, 



,(a)^nno(a)ea|| 2 l /(|ne(a) \ ne Gi; (a) 



in 



: (a) c °l 



i/(n-|ne(o)|) 



< 



pen(|ne(a)|) - pcn(|nc Gs (q)|) 
|nc(a) \ne Gs (a)| 



pen(|ne Gs (a)|) 



n - |ne Gs (a) 



(21) 



To conclude, it remains to prove that the bound (21) holds with high probabil- 
ity. Let us call Ai the right expression of (21) and let us derive a lower bound of 
Ai. Afterwards, we shall upper bound with high probability the left expression 
of (21). 

Upper bound of A\. We first upper bound the penalty function. 

Lemma 5.2. Let d\ > c?2 be two positive integers such that d\ < e~ 3 / 2 (p — 1). 
We have 



pen(di) — pen(e?2) > 2K(d\ — d-i) log 



p-di 
di 



(22) 
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A proof of this lemma is provided in Section 5.4. By Proposition 4 in [2], the 
penalty pen(|ne Gs (a)|) satisfies 

pen(|ne Gs (a)|) < LK l Gs{ n log ' 



|ne Gs (a) 



where L is some numerical constant. This last term converges towards as n 
goes to infinity since |ne Gl ,(a)| < (n s / \og(p)) A (n/log(p) 2 ) (Assumption 2). 
Gathering this upper bound with Lemma 5.2, we get 

l 0t r ( P-\ n < a )\ \ , s. 

* > ™ 1 lS!n >2K 10,(^(1 -0(1)). (23) 



pcn(|nc G (q)|) - 6 \j ne (a)| 

n-|nc G „ (a) | 



Lower bound of the left part of (21). The random variables involved in 
this expression follow a Fisher distribution with |ne(a)\ne Gs (a)| and n — |ne(a)| 
degrees of freedom. To conclude, we only need to compare the quantile of such 
a variable with the bound (23). Let u £ (0, 1) and let F^^lu) denote the 1 — it 
quantile of a Fisher random variable with D and N degrees of freedom. By 
Lemma 1 in [3], it holds that 



DF^ N (u) < D + 2 A /Z>(l 



D 
> 

' N 



log 



D\ N 
1 + 2— — 
N ' 



exp ( — log ( - ) ] - 1 



Let us set u to 



3 3/2 e |„o(a)\„o Gs (a)\(P~ l nC G S (a)| - 1 

V|ne(a) \ ne GE (a)| 

Since we consider the case n/log(p) 2 > 1 and p > n, the term 4/(n 
|ne(et)|) log(l/w) goes to with n (uniformly w.r.t. ne(a)). 



A 2 = F, \ , ,: i , s,(u) < 1+24 

|nc(a)\noG E (a)\ ,n-\ne(a)\ \ > — \ 



1 



|ne(a) \ ne Gs (a) 
2 



(l + o(l)) log - 
u 



|ne(a) \ ne Gs (a)| 



(l + o(l)) log 



The term log(l/w)/|ne(a) \ ne Gs (a)| goes to infinity with n (uniformly w.r.t. 
ne(a)). Hence, we get 



An 



< 1 



|ne(a) \ne Gs (a)| 



log 



(l + o(l)) 
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Applying the classical inequality log (,) < k\og{el/k), we obtain 
log(p) 



A 2 < 



2 log 



|ne(a) \ne Gs (a)| \ |ne(o) \ nc Gs (a)| 

P 

|ne(a) \ne Gs (a)| 



(l + o(l)) 



< 51og( , , * ) (i + (i)). (24) 



Conclusion. Let us compare the lower bound (23) of A\ with the upper 
bound (24) of A 2 . 

• Let us first assume that |ne(a)| < 2|ne Gs (a)|. Then, we have 

A\ > 2K\og (-. ^—r) (1 - o(l)) > 2K(1 - s) log(p) (1 - o(l)) , 



|ne Gs (a 

since |ne Gl .(a)| < n s /log(p) < p s . In particular, 

A 2 < 5 log ( P ---) (1 + o(l)) < A ± , 

for n large enough since wc assume that 2A'(1 — s) > 5. 
• If |ne(a)| > 2|ne Gs (a)|, we also have 

A a <51og( r -^i > )(l + °(l))<A 1 , 
V|nc(a)|7 

for n large enough since we assume that 2K > 5. 
It follows from Ineq. (21) and the definition of A\ and A 2 that 



[A(ne(a),nC GE (a)) < 0] < \ p 3/2 e |ne(a)\ne Gs (a)| 



-1 



P- K Gs (a)| 
|ne(a) \ ne GE (a)| 

for n larger than some positive constant that may depend on AT, s, but does not 
depend on ne(a). Applying this bound to any neighborhood ne(et) that strictly 
contains ne Gs (a) yields Statement (19): 

P[he(a) 2nc Gs (a)] < p~ 3/2 , 

for n large enough. 

5.3.2. Bound on P (he(a) ^ ne GE (a)) 

Again, wc shall prove that A[ne(a), ne GE (a)] is positive for ne(a) ^ ne Gs (a) 
with overwhelming probability. We recall that # no ( a ) is the vector in W such 
that S 1 / 2 6 ne ( a ,) is the orthogonal projection of Y}/ 2 8 a onto the linear span 
{S 1 ^ • supp(/3 ) c ne ( a )}. Moreover, 1 1 ^ 1 / 2 (6» a — 6» nc(a) ) 1 1 2 = Var(A a |A no(a) ) - 
Var(A a |A_ a ) (see e.g. Lemma 7.1 in [21]). 
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Then, A(ne(a), ne Gs (a)) decomposes as 

A(ne(a), nc Gs (a)) = ILf e(a) [e a + X{9 a - 6 ne(a) )] 



n ne G (a) £ a 



Let k = 6/7 and let us define 



pen(|ne Gs (a)|) 



n - |ne Gs (a)| 



pen(|ne(a)|) 
n — |ne(a)| 



■(») 



n^ e( a)X( 



#ne(a)) 



in 1 , ,x(6 

nc(a) ^ 



'ne(a) J 1 1 n 



n- 1 e 
> J " l ne(a) t a 



in 



ne(a) e a|| n 



We recall that (.,.} n is the inner product associated to the norm 
quantity A(ne(a), ne Gs (a)) is positive if 



The 



(l-«)imL a) X 



e(a))||n > -£ne(a) 



1 



pen(|ne(a)|) 



n — |ne(a)| 



pen(|ne Gs (a)|) pen(|ne(a)|) 



n — |ne Gs (a)| n — |ne(a) 



(25) 



We respectively call A3 and A4 the right and the left terms of the inequality. 
To conclude, we need to control the deviations of these terms in order to prove 
that (25) holds with high probability. 



Upper Bound of A3 . On an event A of probability larger than 1 — 2p 



-3/2 



the random variable 



satisfies (see Lemma 1 in [16]) 



1-2 /31og(p) < 
2n 



Var(X a |X_ a ) 



< 1 + 2A/ rl]^M + 3^M 
2n n 



Let us bound the other random variables involved in (25). As explained in 
the proof of Th.3.1 in [21], the random variables ||n^ a jX(0 — # n e(a))||n an d 
E nc ( a ) follow distributions of linear combinations of \ 2 random variables. We 
apply again Lemma 1 in [16] . On a event A ne ( a ) of probability larger than 

1 - 2p- 3 / 2 e-l n °( Q )l( |i f c ^ 1 )| )~ 1 , it holds that 



ll n nc(q) X ( 6> ~ g nc(q))lln 

Vax(X a \X ne(a) ) -Var(X a \X_ a ] 



> 1 



|ne(a)| 



-21 



■log(p) + |ne(o)|[2 + log(p-l)] 
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and 



E nc (a) |nc(a)| + k 1 



Var(X a |X_ a ) 



+ -J(\ne(a)\+K- 2 ) 



2k 



ne(a)| 2 + log 



ne (a) | 2 + log 



p-1 
\ne(a)\ 



P-1 
|ne(a)| 

! iog(p) 



Wc derive that 



■■(a) 



Var(X a |X_ a ) 



< 



2«T 



|ne(a)| log 



P-1 
|ne(a)| 

y/6|nc(a) | log(p) k' 1 
n n 



■ log(p) 



^ log(p) 



CASE 1: ne(a) is non empty. 



E 



ne(a) 



Var(X a |X_ a ) 



< K 



i 2|ne(a)|log(^)+31og(p) 



(l + o(l)) 



Let us upper bound the terms involving pcn(|nc(a)|) in (25) on the event 

An A nc(a) . 



E„ 



>(a) 



pen(|ne(a)|) 



n — |ne(a)| 



2 pen(|ne(a)|) 
a|ln n- |ne(a)| 

P-1 



/Var(X a |X_ a ) 
31og(p)) (l + o(l)) 



< 2|ne(a)|log 

n \ \ \iie{a)\ 

-— |ne(a)|logf^l) 



This last quantity is negative for n large enough since K > 3. 
• CASE 2: ne(a) is empty. We get the upper bound 



E, 



nc(a) 



1 pen(|ne(a)|) 
n — |ne(a)| 



pcn(|ne(a)|) 
1 n - |ne(a)| 



< 



n 1 + 3 log(p) 



Var(X a |X_ Q ) 



< {Z + K- x )n s - l Vax{X a \X_ a ) , 

Indeed, log(p) has to be smaller than n 8 . If this is not the case, then 
neG s (a) should be empty and ne(a) cannot satisfy neG s (a) ne(a). 
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We conclude that on the event ADA 



ne(a) ) 



A 3 < (3 + K-V-'VarpfalX-a) + \\e a \\l 2^£sMl , 

n- |ne Gs (a)| 

for n large enough. Let us upper bound the penalty term as done in the upper 
bound of A\. 



pen(ne GE (a)) < LK^^\o % 



Since |ne Gs (a)| is assumed to be smaller than 
bounded as follows 



ne Gs (a) 

, the term A3 is upper 



log(p) 



A3 < (K + l)n s - 1 Var(X a |X_ a )0(l) 



(26) 



for n large enough. 



Lower Bound of A4. Let us lower bound the left term A\ in (25) on the 



event A n . 



A± > (l-o(l))(l-/s) [Vax(X a \X ne(A) )-V&T{X a \X- a )] 

> fl-o(l))(l-«) min (9 ab ) 2 min YaT [ X b\ X -b) 
~ K WA Vr\{a} V a ' W &,c G r\{a} Var(X c |X_ c ) 

> (1 - k)(1 - o(l))n s '- 1 Var(X a |X_ a ) . 



Var(X a |X_ a ) 



Thanks to the last bound and (26) and since s' is larger than s, ^3 < A4 on 
the event An A nc( - a ) and for n large enough (not depending on ne(a)). Hence, for 
n large enough the inequality (25) holds simultaneously for all neighborhoods 
ne(a) such that ne Gs (a) ^ ne(a) with probability larger than 1 — 2p~ 3 / 2 — 
2(e/(e - l))p~ 3/2 . We conclude that 



'(he(a) ^ne GE (a)) <6p- 3 / 2 



for n large enough. 



5-4- Lemmas 

Lemma 5.3. For any positive integer d < e~ 3 / 2 (p — 1) 

'p-l 



EDKhi 



— d— 1, 



(d+l)< 



> rf+ 1 



Lemma 5.4. For anj/ positive number x and any positive integers d and N , 
EDKhi(ti, N, x) is an increasing function with respect to d and a decreasing 
function with respect to N . 
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Lemma 5.5. For any integer d > 2, the function 

E[(X d -xty) + ] 
E[(X 2 -x)+] 
is increasing with respect to x as soon as x > d. 

Proof of Lemma 5.2. Let us write L\ = log ^( p ~ 1 )^j and L 2 = log 
Lemma 5.4 ensures that 



p-V 
d2 > 



EDKhi (di + l,n- di - l,e~ Ll ) > EDKhi (d 2 + l,n-d 2 - 1, 



(27) 



Let X\ > X2 be two positive numbers larger than some integer d 2 + 1- By 
Lemma 5.5, it holds that 



DKhi(d2 + l,»-d2-l,aji) E[(X 2 -xi)+] 



-(xi-a2)/2 



DKhi(d 2 + l,n-d 2 -l,a: 2 ) " E[(X 2 -x 2 ) + ] 

By Lemma 5.3, EDKhi(d 2 + 1, n — d 2 — 1, e~ i2 ) is larger than d 2 + 1- Setting 
xi = EDKhi(d 2 + l,n-d 2 - l,e" Ll ) and z 2 = EDKhi(d 2 + l,n-d 2 - l,e~ L2 ), 
we obtain 



EDKhi(d 2 + l,n-d 2 - l.e"^ 1 ) - EDKhi(d 2 + l,n- d 2 - l.e"^ 2 ) > 2(Li - L 2 ) , (28) 

for d 2 > 1. Gathering the bounds (27), (28) with the definition (3) of the penalty 
enables to conclude 



pen(di) — pen(d 2 ) > 2K(d\ — d 2 ) log 



P~ di 
di 



(29) 
□ 



Proof of Lemma 5. 3. We write henceforth Xd and X' N for two independent x 2 
random variables with d and N degrees of freedom. Applying Jensen inequality 
we obtain that for any x > and any d > 2 



d x DKhi(d, AT, x) 



> 



E [(X d - a;). 



> E [(X 2 - = 2 
Setting x = EDKhi(d, N, e~ L ) with L > 0, we obtain 

EDKhi(d, A, e" L ) > 2L - 2 log(d), for d > 2. 

It follows that 



EDKhi 



d + 1, n 



1. 



p-1 
d 



(d+l) ! 



> 2 log 



> 2d log 



p-1 
d 

ed 



which is larger than d if d < e 3 / 2 (p — 1). 



□ 
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Proof of Lemma 5.4- By definition (3) of the function EDKhi, we only have to 
prove that DKhi(d, N, x) is increasing with respect to d and decreasing with 
respect to n. 

Conditioning on Xn (resp. Xj) it suffices to prove the two following facts: 
FACT 1: Let d be a positive integer. For any positive number x, 

dE [(X d+1 - x)+] >(d + 1)E [(X d - x)+] . 

FACT 2: Let TV be a positive integer. For any positive numbers x and x' , 



E 



N 



X 



> E 



: N + 1 



Proof of FACT 1. Let (Zi, . . . , Za+i) be d + 1 independent x 2 random 
variables with 1 degree of freedom. Let Y = 2i=i an d f° r an y * ^ {1, . . .d + 
1}, let rW be the sum YW = X^j^i ^j- The variable V follows a \ 2 distribution 
with d+1 degrees of freedom, while the variables Y'*' follow x 2 distribution 
with d degrees of freedom. It holds that 



d+1 



d(Y-x) + > Yj ( yW ~ x ) • 



(30) 



Indeed, if all the variables YW are larger than x, one observes that d(Y — x), — 

d(J2i=i %i ~ dx) while the second term equals dJ2i=i %i ~ d(d + i)x. If some 
of the variables Y^ l > are smaller than x, it is sufficient to note that the variables 

yW 

are smaller than Y. We prove FACT 1 by integrating the inequality (30). 
Proof of FACT 2. It is sufficient to prove that for any positive number x, 



E 



N 



> 



Ajy + i 
N + 1 



Observe that E 
prove that 



=(*-1)+e[(^-z) + 



Hence, it remains to 



(N + 1)E [(Xjv - Nx) + ] > NE [(X N+1 — (N + l)x) + ] 



(31) 



As in the proof of FACT 1, let (Z\, . . . , Zd+i) be d+ 1 independent x 2 random 



variables with 1 degree of freedom. Let Y = X)i=i an d f° r am7 * ^ {1, . . . d + 
1}, let Y® be the sum yW = Z<. It holds that 

2V+1 

^fy»-JVaA > AT (Y - (A + l)x) + . (32) 
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This bound is trivial i£Y< (N + l)x. If Y is larger than (N + l)x, then the 
second term equals (N + 1) Y^i^O^^ ~~ Nx), which is clearly smaller than 
the first term. Integrating the bound (32) enables to prove (31) and then FACT 
2. □ 

Proof of Lemma 5. 5. We show that the derivate of the function 

E [(Xd — x^jf)+] /E [(X2 — x)+] in non-negative for any x > d. Thus, we have 

to prove the following inequality: 



(X d 



' N 



E 



^ E[(X 2 -x) + ] 2 



Hence, we aim at proving that the function 

Xn 



(X 2 > x) 



*(x) = E 



X d - x- 



N 



Xn 1 



is positive. Observe that 'J(x) converges to when x goes to infinity. Let us 
respectively note fx d (t) and f x N (t) the densities of X d and X^/N. 



*'(x) 



fx d (u)du 



f *N {t)dt 



Integrating by part the density of a x 2 distribution, we get the lower bound 



(l/oW 2 



d/2-l 



*'(x) < 



< 



< 



< 



< 



{1/2Y/ 2 - 1 

T{d/2) 
(1/2) 



tixty'^e-* 1 ' 2 ^- l)/x^(t)* 



(JV+d)/2-l 



N N/2 x d/2 



r(d/2)r(Jv/2) Jt=0 

2N N/2 x d/2-l 

T(d/2)r(7V/2)(x + N)( d + N )/ 2 

2N N/2 x d/2-l 

T(d/2)T(N/2)(x + 7V)( d + w )/ 2 

2N N/2 x d/2-l r (d±JV) 

T(d/2)r(iV/2)(x + N)( d + N )/ 2 



t^it-l^^e-^+^dt 



(d+N)/2-l 



t 



2r(^ 



2t 



x + N 
d + N 



x + N 
, d+. 



1 e-*dt 



x + N 



- 1 



<0 , 



since x > d. Hence, VP is decreasing to for x larger than d and it is therefore 
non-negative. □ 
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